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ABSTRACT 

Cascades represent an important phenomenon across vari- 
ous disciplines such as sociology, economy, psychology, po- 
litical science, marketing, and epidemiology. An important 
property of cascades is their morphology, which encompasses 
the structure, shape, and size. However, cascade morphol- 
ogy has not been rigorously characterized and modeled in 
prior literature. In this paper, we propose a Multi-order 
Markov Model for the Morphology of Cascades (M'^C) that 
can represent and quantitatively characterize the morphol- 
ogy of cascades with arbitrary structures, shapes, and sizes. 

can be used in a variety of applications to classify 
different types of cascades. To demonstrate this, we apply 
it to an unexplored but important problem in online social 
networks - cascade size prediction. Our evaluations using 
real-world Twitter data show that M^C based cascade size 
prediction scheme outperforms the baseline scheme based 
on cascade graph features such as edge growth rate, degree 
distribution, clustering, and diameter. M^C based cascade 
size prediction scheme consistently achieves more than 90% 
classification accuracy under different experimental scenar- 
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1. INTRODUCTION 

1.1 Background and Motivation 

The term cascade describes the phenomenon of something 
propagating along the links in a social network. That some- 
thing can be information such as a URL, action such as a 
monetary donation, influence such as buying a product, dis- 
cussion such as commenting on a blog article, and a resource 
such as a torrent file. Based on what is being propagated, we 
can categorize cascades into various classes such as informa- 
tion cascades ^ , action cascades [s] , influence cascades [20] 
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and resource cascades [33]. Con- 
sider a toy example where user A, connected to users B and 
C in a social network, broadcasts a piece of information (e.g. 
a picture or a news article) to his neighbors. Users B and 
C, after receiving it from user A, may further rebroadcast 
it to their neighbors resulting in the formation of a cascade. 

Cascade phenomenon has been a fundamental topic in 
many disciplines such as sociology, economy, psychology, po- 
litical science, marketing, and epidemiology with research 

A key challenge 
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literature tracing back to the 1950s 
in these studies is the lack of large scale cascade data. 
As online social networks have recently become a primary 
way for people to share and disseminate information, the 
massive amount of data available on these networks pro- 
vides unprecedent opportunities to study cascades at a large 
scale. Recent events, such as the Iran election protests, Arab 
Spring, Japanese earthquake, and London riots, have been 
signiflcantly impacted by campaigns via cascades in online 
social networks |36[ |27[ [To| . Studying cascades in online so- 
cial networks will benefit a variety of domains such as social 
campaigns [36|, product marketing and adoption f25^ , online 
discussions |13|, sentiment fiow [26j, URL recommendation 
[29|, and meme tracking 14 . 



1.2 Problem Statement 

The goal of this paper is to study the morphology of cas- 
cades in online social networks. Cascade morphology en- 
compasses many aspects of cascades such as their structures, 
shapes, and sizes. Specifically, we aim to develop a model 
that allows us to represent and quantitatively characterize 
cascade morphology; which are extremely difficult without 
a model. There are two important requirements on the de- 
sired model of cascade morphology. First, this model should 
have enough expressivity and scalability to allow us to repre- 
sent and describe cascades with arbitrary structures, shapes, 
and sizes. Real- world cascades sometimes have large sizes, 
containing thousands of nodes and edges [22]. Second, this 



model should allow us to quantitatively characterize and rig- 
orously analyze cascades based on the features extracted 
from this model. 

1.3 Limitations of Prior Art 

Despite the numerous publications regarding different as- 
pects of online social networks, little work has been done on 
the morphology of cascades. Recently some researchers have 

13) ; however, 
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studied the structure of cascades 

their analysis of cascade structures is limited to basic struc- 
tural properties such as degree distribution, size, and depth. 
These structural properties of cascades are important; how- 
ever, they are far from being sufficient to precisely describe 
and represent cascade morphology. 

1.4 Proposed Model 

In this paper, we propose a Multi-order Markov Model 
for the Morphology of Cascades (M'^C) that can represent 
and quantitatively characterize the morphology of cascades 
with arbitrary structures, shapes, and sizes. M*C has two 
key components: a cascade encoding algorithm and a cas- 
cade modeling method. The cascade encoding algorithm 
uniquely encodes the morphology of a cascade for quantita- 
tive representation. It encodes a cascade by first perform- 
ing a depth-first traversal on the cascade graph and then 
compressing the traversal results using run-length encoding. 
The cascade modeling method models the run-length en- 
coded sequence of a cascade as a discrete random process. 
This random process is further modeled as a Markov chain, 
which is then generalized into a multi-order Markov chain 
model. M*C satisfies the aforementioned two requirements. 
First, this model can precisely represent cascades with arbi- 
trary structures, shapes, and sizes. Second, this model al- 
lows us to quantitatively characterize cascades with different 
attributes using the state information from the underlying 
multi-order Markov chain model. 

1.5 Experimental Evaluation 

To demonstrate the effectiveness of our M^C model in 
quantitatively characterizing cascades, we use it to investi- 
gate an unexplored but important problem in online social 
networks - cascade size prediction: given the first ti edges in 
a cascade, we want to predict whether the cascade will have 
a total of at least T2 (V2 > ti) edges over its lifetime. This 
prediction has many real-world applications. For example, 
media companies can use it to predict social media stories 
that can potentially go viral [15[ |29| . Furthermore, solving 
this problem enables early detection of epidemic outbreaks 
and political crisis. Despite its importance, this problem has 
not been addressed in prior literature. 

We validate the effectiveness of M'^C based cascade size 
prediction scheme on a real-world data set collected from 
Twitter containing more than 8 million tweets, involving 
more than 200 thousand unique users. The results show 
that our A/*C based cascade size prediction scheme consis- 
tently achieves more than 90% classification accuracy under 
different experimental scenarios. We also compare our 
based cascade size prediction scheme with a baseline predic- 
tion scheme based on cascade graph features such as edge 
growth rate, degree distribution, clustering, and diameter. 
The results show that M^C allows us to achieve significantly 
better classification accuracy than the baseline method. 



1.6 Key Contributions 

In this paper, we not only propose the first cascade mor- 
phology model, but also propose the first cascade size pre- 
diction scheme based on our model. In summary, we make 
the following key contributions in this paper. 

1. We propose M^C for representing and quantitatively 
characterizing the morphology of cascades with arbi- 
trary structures, shapes, and sizes. 

2. To demonstrate the effectiveness of our M'^C model m 
quantitatively characterizing cascades, we develop a 
cascade size prediction scheme based on M*'C features 
and compare its performance with that based on non- 
M^C features. 

The rest of this paper proceeds as follows. We first review 
related work in Section [2] We then introduce our proposed 
model in Section (3) We describe the details of our Twitter 
data set in Section |4] We present the experimental results 
of the aforementioned application in Section |5] Finally, we 
conclude in Section |6] with an outlook to our future work. 

2. RELATED WORK 

Cascades in online social networks have attracted much 
attention and investigation; however, little work has been 
done on cascade morphology. Below we summarize the prior 
work related to cascade morphology. 

2.1 Shape 

Zhou et al. studied Twitter posts (i.e., tweets) about the 
Iranian election [36]. In particular, they studied the fre- 
quency of pre-defined shapes in cascades. Their experimen- 
tal results showed that cascades tend to have more width 
than depth. The largest cascade observed in their data has 
a depth of seven hops. Leskovec et al. studied patterns in 
the shapes and sizes of cascades in blog and recommenda- 
tion networks [24| |23| . Their work is also limited to studying 
the frequency of fixed shapes in cascades. 

2.2 Structure 

Kwak et al. investigated the audience size, tree height, and 
temporal characteristics of the cascades in a Twitter data set 
[22| . Their experimental results showed that the audience 
size of a cascade is independent of the number of neighbors 
of the source of that cascade. They found that about 96% 
of the cascades in their data set have a height of 1 hop and 
the height of the biggest cascade is 11 hops. They also found 
that about 10% of cascades continue to expand even after 
one month since their start. Romero et al. specifically stud- 
ied Twitter cascades with respect to hashtags in terms of 
degree distribution, clustering, and tie strengths |31J. The 
results of their experiments showed that cascades from di- 
verse topics (identified using hashtags), such as sports, mu- 
sic, technology, and politics, have different characteristics. 
Similarly, Rodrigues et al. studied structure-related proper- 
ties of Twitter cascades containing URLs 
led cascade properties like height, width, 
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They stud- 



and the number 
of users for cascades containing URLs from different web 
domains. Sadikov et al. investigated the estimation of the 
sizes and depths of information cascades with missing data 



32 . Their estimation method uses multiple features includ- 



ing the number of nodes, the number of edges, the number 
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Figure 1: Toy example of cascade construction and encoding. 



of isolated nodes, the number of weakly connected compo- 
nents, node degree, and non-leaf node out-degree. Their em- 
pirical evaluation using a Twitter data set showed that their 
method accurately estimates cascade properties for varying 
fractions of missing data. 

2.3 Simulation 

Gomez et al. studied the structure of discussion cascades 
in Wikipedia, Slashdot, Barrapunto, and Meneame using 
features solely based on the depth and degree distribution of 
cascades 13 . They also developed a generative model based 



on the maximum likelihood estimation of preferential at- 
tachment process to simulate synthetic discussion cascades. 
However, their model does not capture morphological prop- 
erties of cascades and is limited to generation of synthetic 
discussion cascades. 

3. PROPOSED MODEL 

In this section, we present M^C for quantitatively repre- 
senting the morphology of cascades in online social networks. 
It consists of two major components. The first component 
encodes a given cascade graph for quantitative represen- 
tation such that its morphological information is retained. 
The second component models the encoded sequence using 
a multi-order Markov chain. Before we describe these two 
components, we first present the details of the cascade graph 
construction process. 

3.1 Cascade Graph Construction 

A social network can be represented using two graphs, a 
relationship graph and a cascade graph. Both graphs share 
the same set of nodes (or vertices) V, which represents the 
set of all users in a social network. A relationship graph rep- 
resents the relationships among users in a social network. 
In this graph, nodes represent users and edges represent the 
relationship among users. If the edges are directed, where a 
directed edge from user u to user v denotes that w is a fol- 
lower of u, then this graph is called a follower graph, denoted 
as {V,Ef), where V is the set of users and Ef is the set of 
directed edges. If the edges are undirected, where an undi- 
rected edge between user u and user v denotes that it and 
V are friends, then this graph is called a friendship graph, 



denoted as {V, Ef), where V is the set of users and Ef is the 
set of undirected edges. By the nature of our study, we focus 
on the follower graph denoted as G/ = {V, Ef). The cascade 
graph represents the dynamic activities that are taking place 
in a social network (such as users sharing a URL or joining a 
group) . A cascade graph is an acyclic directed graph denoted 
as Gc = {V,Ec,T) where V is the set of users, Ec is a set 
of directed edges where a directed edge e — {u,v) from user 
u to user v represents the propagation of something from 
u to V, and T is a function whose input is an edge e £ E^ 
and output is the time when the propagation along edge e 
happens. 

While the static relationship graph is easy to construct 
from a social network, the dynamic cascade graph is non- 
trivial to construct because there maybe multiple propaga- 
tion paths from the cascade source to a node. So far there 
is no consensus on cascade graph construction in prior liter- 
ature. In this paper, we use a construction method that is 
similar to the method described in 32 . We next explain our 



construction method through a Twitter example. Consider 
the follower graph in Figure [TJ a). Let {u,t) denote a user u 
performing an action, such as posting a URL on u's Twit- 
ter profile, at time t. Suppose the following actions happen 
in the increasing time order: {A,ti), {B,t2), (0,1^), {C,t4), 
[E,tf,), where ti < t2 < ti < < ta. Suppose (^,ti) de- 
notes that A posts a URL on his Twitter profile, and all 
other actions (namely {B,t2), {D,ts), {C,t4), and {E,t^,)) 
are reposting the same URL from A. 

The cascade graph regarding the propagation of this URL 
is constructed as follows. First, A is the root of the cascade 
graph because it is the origin of this cascade. Second, B 
reposting A's tweet (which is a URL in this example) at 
time t2 must be under A's infiuence because there is only 
one path from A to B in the follower graph in Figure [Tj^ a). 
Therefore, in the cascade graph in Figure [TJb), there IS an 
edge from Ato B with time stamp t2 . Note that each repost 
(or retweet in Twitter's terminology) contains the origin of 
the tweet [A in this example). Third, however, D reposting 
A's tweet at time could be under either A^s infiuence 
(because there is a path from ^4 to _D in the follower graph in 
Figure [ija) and t\ <tz) or _B's infiuence (because there is a 
path from i? to D in the follower graph as well and t2 < ts). 



Note that even if D sees A's tweet through _B's retweet, 
the repost of ^'s tweet on D's profile does not contain any 
information about B and only shows that the origin of the 
tweet is A. In this scenario, we assume that D is partially 
influenced by both A and B, instead of assuming that D is 
influenced by either user B or A, because this way we can 
retain more information with respect to the corresponding 
follower graph. Therefore, there is an edge from AtoD and 
another edge from B to D in the cascade graph shown in 
Figure [Tj^b), where the time stamps of both edges are t-^. 
Similarly, we add the edge from i? to C with a time stamp 
t4 and the edge from D to E with a time stamp ti, in the 
cascade graph. 

3.2 Cascade Encoding 

The first step in cascade encoding is to encode the con- 
structed cascade graph as a binary sequence that uniquely 
represents the structure of the cascade graph. Graph encod- 
ing has been studied for a wide range of problems across 
several domains such as image compression, text and speech 
recognition, and DNA profiling [28| [sj [16] . The typical goal 
of graph encoding is to transform large geometric data into a 
succinct representation for efficient storage and processing. 
However, our goal here is to encode a given cascade graph 
in a way that its morphological information is captured. To- 
wards this end, we use the following graph encoding algo- 
rithm. 

We first conduct a depth-first traversal of the constructed 
cascade graph starting from the root node, which results 
in a spanning tree. To result in a unique spanning tree, 
at each node in the cascade graph, we sort the outgoing 
edges in the increasing order of their time stamps, i.e., 
sort the outgoing edges ei,e2,--- , of a node so that 
T(ei) < T(e2) < • ■ ■ < T{ek); and then traverse them in this 
order. For each edge, we use 1 to encode its downward traver- 
sal and to encode its upward traversal. Figure llVc) shows 
the traversal of the cascade graph in Figure ^h) and the 
encoding of each downward or upward traversal. The binary 
encoding results from this traversal process is 11011000. Let 
C represent the binary code of a cascade graph G = {V,~^). 
Then the length of the binary code |C| is always twice the 

size of the edge set I.e., \C\ = 2\E\. Furthermore, let 
C[i] be the i-th element of the binary code and I{C[i]) be 
an indicator function so that /(C[i]) = 1 if C[i] — 1, and 
I{C[i]) = — 1 if C[i] = 0. Because each edge is exactly tra- 
versed twice, one downward and one upward, we have: 

\c\ 

^/(C[i]) = 0. 

The second step in cascade encoding is to convert the bi- 
nary sequence, which is obtained from the depth-first traver- 
sal of the cascade graph, into the corresponding run-length 
encoding. A run in a binary sequence is a subsequence where 
all bits in this subsequence are Os (or Is) but the bits before 
and after the subsequence are Is (or Os), if they exist. By 
replacing each run in a binary sequence with the length of 
the run, we obtain the run-length encoding of the binary se- 
quence 19 . For example, for the binary sequence 11011000, 
the corresponding run-length encoding is 2123. Since the bi- 
nary sequence obtained from our depth-first traversal of a 
cascade graph always starts with 1, the run- length encoding 
uniquely and compactly represents the binary sequence. 



3.3 Markov Chain Model of Cascades 

We want to model cascade encoding to capture charac- 
teristics of cascades so that they can be used to identify 
the similarities and differences among cascades. This model 
should allow us to extract morphological features for differ- 
ent classes of cascades and then use these features to classify 
them. We first present our model, and then demonstrate its 
usefulness in classifying cascades. 

Consider the run-length encoded sequence (7 of a cascade 
graph G. We can model this sequence using a discrete ran- 
dom process {Ck}, k — 1, 2, IC*! . Basic analysis of this pro- 
cess reveals that there is some level of dependencies among 
the consecutive symbols emitted by the random process. In 
other words, it would be unreasonable to assume that the 
process is independent or memoryless. Meanwhile, to bal- 
ance between capturing some of the dependencies within 
the process and to simplify the mathematical treatment of 
this encoded sequence, we resort to invoking the Markovian 
assumption [s]. As we show later, this assumption can be 
reasonably justified (to some extent) by analyzing the au- 
tocorrelation function of the underlying process {Ck}- For 
a first order Markov process, this implies the following as- 
sumption: Pr[C„ = C„\Cl = Ci,C2 = C2,...,C„-l = c„_i] = 
Pr[Cn = Cn\Cn-i — c„_i]. Equivalcutly : 

Pr[ci,C2,...,c„] = Pr[ci]Pr[c2|ci]...Pr[c„|c„_i]. (1) 

In other words, we invoke the Markovian assumption about 
the underlying cascade process and its morphology, which is 
represented by the encoded sequence C. 

Given the Markovian assumption with homogeneous time- 
invariant transition probabilities, C can be represented using 
a traditional Markov chain. Figure[2]shows the Markov chain 
corresponding to the toy example in Figure [T] where each 
unique symbol in C is represented as a state. The Markov 
chain in Figure [2] has 3 states because there are 3 unique 
symbols in its run-length encoding. 




Figure 2: Markov chain model for the toy example. 

A Markov chain can also be specified in terms of its state 
transition probabilities, denoted as T. Hence, for the toy 
example of Figure 2, we have: 
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where P^j represents the conditional probabilities Pr[Cn = 
i\Cn-i = j] - The Markov chain framework allows us to quan- 



tify the probability of an arbitrary sequence of states by us- 
ing Equation [l] This will help us to identify sequences that 
are more (or less) probable in one class of cascades. We next 
further generalize the above basic Markov chain model. 

3.4 Multi-order Generalization 

Each element of the state transition matrix of a Markov 
chain is equivalent to a sub-sequence of C, which in turn is 
equivalent to a subgraph of the corresponding cascade. We 
can generalize a Markov chain model by incorporating multi- 
ple consecutive transitions as a single state in the state tran- 
sition matrix, which will allow us to specify arbitrary sized 
subgraphs of cascades. Such generalized Markov chains are 
called multi-order Markov chains and are sometimes referred 
to as full-state Markov chains [2l]. The order of a Markov 
chain represents the extent to which past states determine 
the present state. The basic Markov chain model introduced 
earlier is of order 1. 

Autocorrelation is an important statistic for selecting ap- 
propriate order for a Markov chain model [5]. For a given 
lag t, the autocorrelation function of a stochastic process, 
Xm (where m is the time or space index), is defined as: 



P[t\ 



E{XoXt} - E{Xo}E{Xt} 



(2) 



where E{-) represents the expectation operation and ax^ is 
the standard deviation of the random variable at time or 
space lag i. The value of the autocorrelation function lies 
in the range [—1,1], where \p\t\\ = 1 indicates perfect cor- 
relation at lag t and p\t] — means no correlation at lag 
t. Figure [3] plots the sample autocorrelation function of the 
run-length encoding of an example cascade. The dashed hor- 
izontal lines represent the 95% confidence envelope. For this 
particular example, we observe that sample autocorrelation 
values jump outside the confidence envelope at lag = 3. This 
indicates that the underlying random process has the third 
order dependency. Thus, we select the third order for Markov 
chain model for this particular cascade. The autocorrelation- 
based analysis of more complex cascades can lead to even 
higher order Markov chains. 
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Figure 3: Sample autocorrelation function for the 
toy example. 

The number of possible states of a Markov chain increase 
exponentially with an increase in the order of the Markov 
chain model. For the n-th order extension of a Markov chain 
with k states, the total number of states is fc". Figure|4]shows 



the plot of the second order extension of the 3-state, 1-st 
order Markov chain model shown in Figure [2] This second 
order Markov chain contains a total of 3^ = 9 states, 4 
of which are shown in the figure due to space limitations. 
In this second order Markov chain model, the conditional 
probabilities are in the form Pi,j\k,i and the state transition 
matrix is now defined as follows. 
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Figure 4: Multi-order generalization of the Markov 
chain model for the toy example. 

For a set of cascade encoding sequences, let T denote the 
set of selected orders as per the aforementioned criterion. 
We select the maximum value in T, denoted by Tmax, as 
the order of a single Markov chain model that we want to 
employ. 

3.5 Cascade Classification 

As mentioned in Section fl.2| an important desirable prop- 
erty for our proposed model is to identify differentiating fea- 
tures of cascade morphology that can be potentially lever- 
aged for automated classification of cascades. We now show 
how to use the aforementioned Markov chain model to clas- 
sify cascades. 

3.5.1 Feature Selection 

The essence of our modeling approach is to capture the 
morphology of a cascade through the states of the multi- 
order Markov model. Each state in the Markov chain repre- 
sents a likely sub-structure of cascades' morphology. Thus, 
we can use these states to serve as underlying features that 
can be used to characterize a given cascade and to determine 
the class that it might belong to. However, as mentioned 
earlier, the number of states in a Markov chain increase ex- 
ponentially for higher orders and so does the complexity 
of the underlying model. Furthermore, higher order Markov 



chains require a large amount of training data to identify a 
subset of states that actually appear in the training data. 
In other words, a Markov chain model trained with limited 
data is typically sparse. Therefore, we use the following two 
approaches to systematically reduce the number of states in 
the Markov chain of order Tmax ■ 

First, we can combine multiple states in the Markov chain 
to reduce its number of states. By combining states in a 
multi-order Markov chain, we are essentially using states 
from lower order Markov chains. We need to establish a 
criterion to combine states in the Markov chain. Towards 
this end, we use the concept of typicality of Markov chain 
states. Typicality allows us to identify a typical subset of 
Markov chain states by generating its realizations Be- 
fore delving into further details, we first state the well- 
known typicality theorem below: For any stationary and 
irreducible Markov process X and a constant c, the se- 
quence xi, X2, Xm is almost surely (n, e)-typical for every 
n < clogm as m — > oo. A sequence xi,X2, ■■■,Xm is called 
(n, e)-typical for a Markov process X if P(x\,X2, ■■■,Xn) = 0, 
whenever P{x\, X2, Xn) = 0, and 



P{X1,X2, 







P{X1,X2, ■■■,Xn) 



- 1 



< e, when P{xi,X2, ■■■,x„) > 0. 



Here P{xi, X2, x„) and P{xi,X2, ■■■,x„) are the empirical 
relative frequency and the actual probability of the sequence 
Xi,X2, ■■■,x„, respectively. In other words, 



P(xi,X2, 



P{xi,X2, ...,Xr, 



This theorem shows us a way of empirically identifying typ- 
ical sample paths of arbitrary length for a given Markov 
process. Based on this theorem, we generate realizations (or 
sample paths) of arbitrary lengths from the transition ma- 
trix of the Markov process. By generating a sufficiently large 
number of sample paths of a given length, we can identify 
a relatively small subset of sample paths that are typical. 
Using this criterion, we select a subset of up to top-100, 000 
typical states as potential features, whose lengths vary in 
the range [0,Tmaa;]. In what follows, we further short-list 
the Markov states from the top-100, 000 typical subset and 
use them as features to classify cascades. 

Second, to further reduce the number of features to be 
employed in a classifier, we need to prioritize the aforemen- 
tioned typical Markov states. The prioritization of features 
can be based on their differentiation power. An information 
theoretic measure that can be used to quantify the differ- 
entiation power of features (Markov states in our case) is 
information gain [?]. In this context, information gain is the 
mutual information between a given feature Xi and the class 
variable Y . For a given feature Xi and the class variable Y, 
the information gain of Xi with respect to Y is defined as: 

IG{X,-Y) ^ H{Y) - H{Y\X^), 

where H{Y) denotes the marginal entropy of the class vari- 
able Y and H(Y\Xi) represents the conditional entropy of 

Y given feature Xi. In other words, information gain quan- 
tifies the reduction in the uncertainty of the class variable 

Y given that we have complete knowledge of the feature Xi . 
Note that, in this paper, the class variable Y is {0, 1} because 
we apply our morphology modeling framework to problems 
that require differentiating between two classes of cascades 
(as described later). In this study, we eventually only select 
the top-100 features with highest information gain. 



3.5.2 Classification 

Let us assume that the presence of a state i is represented 
by a binary random variable Xi,i — 1,2, ...,100. Hence, 
P{Xi = 1) represents the probability for the presence of 
state Xi. We can think of the XiS as the variables repre- 
senting potential features. Thus, our training process pro- 
ceeds as follows. For a given class Y of cascades, we evalu- 
ate the presence of a given feature (state) Xi in Y by an- 
alyzing a sufficiently large number of sample cascades that 
belong to the class Y. Subsequently, we are able to evaluate 
the a-priori conditional probability P{Xi\Y) for each class 
Y G {1, 2, A:}, where the number of classes k is usually 
very small. In our case, we are interested in the traditional 
binary classifier with k — 2. However, note that this classifi- 
cation methodology can be extended to the cases with k > 2 
using the well-known one-against-one (pairwise) or multiple 
one-against-all formulations [17| . 

We can jointly use multiple features to differentiate be- 
tween two sets of cascades belonging to different classes. In 
particular, given the top-100 features with respect to infor- 
mation gain, we can classify cascades by deploying a ma- 
chine learning classifier. In this study, we use a Bayesian 
classifier to jointly utilize the selected features to classify cas- 
cades. Naive Bayes is a popular probabilistic classifier that 
has been widely used in the text mining and bio-informatics 
literature, and is known to outperform more complex tech- 
niques in terms of classification accuracy [35]. It trains us- 
ing two sets of probabilities: the prior, which represents the 
marginal probability P{Y) of the class variable Y; and the a- 
priori conditional probabilities P{Xi\Y) of the features Xi 
given the class variable Y. As previously explained, these 
probabilities can be computed from the training set. 

Now, for a given test instance of a cascade with observed 
features Xi, i — l,2,...,n, the a-posteriori probability 
P(F|X'"') can be computed for both classes Y G {0,1}, 
where X'"' = {X\, X2, Xn) is the vector of observed 
features in the test cascade under consideration: 



(„) _ P(xw,y) _ p(x(")|y)P(y) 
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The naive Bayes classifier then combines the a-posteriori 
probabilities by assuming conditional independence (hence 
the "nai've" term) among the features. 



P{X^"^\Y) = \{P{X,\Y). 



(4) 



Although the independence assumption among features 
makes it feasible to evaluate the a-posteriori probabilities 
with much lower complexity, it is unlikely that this assump- 
tion truly holds all the time. For our study, we mitigate the 
effect of the independence assumption by pre-processing the 
features using the well-known Karhunen-Loeve Transform 
(KLT) to uncorrelate them [9] . 

In the following section, we provide details of the data set 
that we have collected to demonstrate the usefulness of our 
M^C model. 
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Figure 5: Typical examples of real-world Twitter cascades. 



4. DATA SET 

4.1 Data Collection 

Among the popular online social networks, Twitter is one 
of the social networks that allows systematic collection of 
public data from its site. Therefore, we chose to study the 
morphology of cascades appearing on Twitter. To collect 
data from Twitter, we focused on tweets related to the Arab 
Spring event, which represents an ideal case study because 
it spans several months. For countries involved in the Arab 
Spring event, we collected data from Twitter during one 
complete week in March 2011. We provide more details of 
the data collection process in the following text. 

For our study, we separately collected two data sets from 
Twitter. The first data set was collected using Twitter's 
streaming API, which allows the realtime collection of public 
tweets matching one or more filter predicates [2]. To collect 
tweet data pertaining to a given country, we provided rel- 
evant keywords as filter predicates. For example, we used 
the keywords 'Libya' and 'Tripoli' to collect tweets related 
to Libya. In total, we collected tweets for 8 countries over a 
period of a week in March 2011. Using Twitter's streaming 
API, we collected more than 8 million tweets involving more 
than 200 thousand unique users. 

As mentioned in Section |3.1| we cannot accurately con- 
struct cascade graphs without information about whom the 
users are following. The one-way following policy of Twit- 
ter results in three types of relationships between two given 
users: (1) both follow each other, (2) only one of them fol- 
lows the other, and (3) they do not follow each other. Twitter 
provides follower information for a given user via a separate 
interface called REST API jl]. REST API employs aggres- 
sive rate limiting by allowing clients to make only a limited 
number of API calls in an hour. Twitter applies this limit 
based on the public IP address or authentication token from 
the client who issues the request. Currently, rate limiting 
for REST API permits only 150 requests per hour for unau- 
thenticated users and 350 requests per hour for authenti- 
cated users. In our tweet data set, we encountered more 
than 200, 000 unique users and we were required to make at 
least one request per user to get the follower list. For each 
user who follows more than 5000 users, we had to make 
a separate call to get each subset of 5000 users. Here it is 
noteworthy that some users were following or were being fol- 
lowed by millions of users, requiring thousands of separate 



calls for each user. It would take us several months to collect 
this data if we were to use a single authentication token or 
a single external IP address. To overcome this limitation, 
we utilized dozens of public proxy servers to parallelize calls 
to Twitter's REST API [3t|. Using this methodology, we 
collected follower lists of all users in less than a month. 

Twitter provides a "re-tweet" functionality which allows 
users to re-post the tweet of other users to their profiles. 
The reference to the user with original tweet is maintained 
in all subsequent re- tweets. There is no information on in- 
termediate users in re-tweets. Using the follower graph, we 
constructed cascade graphs for all sets of re-tweets which 
are essentially cascades. Therefore, the overall graph is a 
union of all cascades in our data. In Figure [5j we visualize 
two cascades in our data set using radial and circular layout 
methods in Graphviz 1 . In a radial layout, we choose the 
user with original tweet as a center vertex (or root vertex 
in general) and the remaining vertices are put in concen- 
tric circles based on their proximity to the center vertex. 
In a circular layout, all components are plotted separately 
with their respective vertices in a circular format. Visualiza- 
tion of two example cascades provides us interesting insights 
about their morphology. From the first example, we observe 
that the degree of vertices typically decreases as their dis- 
tance from the root vertex increases. However, for the second 
example, we observe that subsequent vertices have degrees 
comparable to the root vertex. In this paper, our aim is to 
capture such differences in an automated fashion using our 
proposed model. 

4.2 Data Analysis 

We now analyze the structural features of the cascades 
in our collected data set in terms of degree, path, and con- 
nectivity. Later in Section [5j we will use these features for 
baseline comparison with our proposed model in terms of 
classification accuracy. For structural features that can only 
be computed from undirected graphs, such as clustering co- 
efficient and diameter, we compute them on the undirected 
versions of cascade graphs. 

4.2.1 Degree Properties 

We first jointly study the number of edges and the number 
of nodes for all cascades in our data set. The cascade graphs 
in our data set are connected and each user in the cascade 
graph has at least one inward or outward edge. Therefore, 
the number of edges in a cascade graph |_E| has the lower 
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Figure 6: Distributions of various cascade 

bound: \E\ > \V\ — 1, where \V\ is the number of users par- 
ticipating in the cascade. Figure [sja) shows the scatter plot 
between edge and node counts for all cascades in our data 
set. Note that we use the logarithmic scale for both axes. 
From this figure, we observe that the scatter plot takes the 
form of a strip whose thickness represents the average num- 
ber of additional edges for each node. The average thickness 
of this strip approximately corresponds to having twice the 
number of edges compared to the number of nodes. 
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graph attributes in the Twitter data set. 

4.2.2 Path Properties 

Another important characteristic of a cascade is the degree 
of the root node (user who initiated the cascade) , which typ- 
ically has the highest degree compared to all other nodes in 
a cascade graph. In our data set, the root node has the high- 
est degree compared to all other nodes in cascade graphs for 
more than 92% of the cascades. The degree of the root node 
essentially represents the number of different routes through 
which cascade propagates in an online social network. Note 
that these paths may merge together after the first hop; 
however, we expect some correlation between the degree of 



root node and the number of unique routes through which 
a cascade propagates. One relevant characteristic of a graph 
is average (shortest) path length (APL), which denotes the 
average of all- pair shortest paths [ij. 
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where d{i,j) is the shortest path length between users i and 
j. We expect the average path length of a cascade to be pro- 
portional to the degree of the root node. Figure |6|b) shows 
the scatter plot of the root node degree and the average path 
length. As expected, we observe that cascades with higher 
root node degrees tend to have larger average path lengths. 
We have changed the x-axis to logarithm scale to emphasize 
this relationship. 

Another fundamental characteristic of a graph is called 
diameter, which denotes the largest value of all-pair shortest 
paths [4]. Figure [6|c) shows the distribution of diameter of 
cascades in our data set. The bars represent the probability 
mass function and the line represents the cumulative density 
function (CDF). The minimum diameter is 1 because the 
minimum number of nodes in a cascade is 2. Cascades with 
more than 2 nodes can have a diameter of 1 only if they are 
cliques. In our data set, approximately 40% cascades have a 
diameter of 1. The largest cascades in our data set have a 
diameter of 9. 

Finally, we can characterize the number of unique paths 
that connect nodes in a graph by using the notion of span- 
ning trees. For a given graph, the number of unique paths 
between nodes is proportional to the number of spanning 
trees. The number of spanning trees of a graph G, denoted 
by t{G), is given by the product of non-zero eigenvalues of 
the Laplacian matrix and the reciprocal of the number of 
nodes [i]. 

t{G) = — AlA2...An-l, 

n 

where n is the number of nodes of the graph and Xi is the 
i-th eigenvalue of the Laplacian matrix of the graph and 
Xi ^ 0,Vi. Figure [6]^d) shows the CDF of the number of 
spanning trees for cascades in our data set. Note that the 
X-axis is converted to logarithm scale. We observe that only 
a small fraction (< 15%) of cascades have more than one 
spanning tree in our data set, which highlights their sparsity. 

4.2.3 Connectivity Properties 

The clustering coefficient of a vertex Vi is denoted by d 
and is defined as the ratio of the number of existing edges 
among Vi and Vi's neighbors and the number of all possible 
edges among them [2]. Using Ai to denote the number of 
triangles containing vertex Vi and di to denote the degree of 
vertex Vi, the clustering coefficient of vertex Vi is defined as: 

_ _ 2A, 

The average clustering coefficient of a graph G with n nodes 
is simply the mean of clustering coefficients of individual 
nodes. 
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Figure |6];e) shows the CDF of the average clustering coeffi- 
cient for all cascades in our data set. We note that approx- 
imately 86% of all cascades in our data set have average 



clustering coefficient value equal to 0, i.e., they do not have 
a single triangle. Only a small fraction (less than 2%) of 
cascades in our data set have clustering coefficient values 
greater than 0.5, which again highlights their sparsity. 

We are also interested in investigating the sizes of cliques 
in cascades that have one or more triangles. Towards this 
end, we study the clique numbers of all cascade graphs in 
our data set. The clique number of a graph is the number 
of vertices in its largest clique Figure [6]^f) shows the 
distribution of clique number for all cascades in our data 
set. Similar to our observation in Figure |6|e), we observe 
that approximately 86% of cascades have a clique number 
of 2, which means that they do not have a triangle. A little 
more than 10% of cascades have at least one triangle. The 
largest clique number observed in our data set is 6. 

5. CASCADE SIZE PREDICTION 

To demonstrate the effectiveness of our M'^C model in 
quantitatively characterizing cascades, we use it to investi- 
gate an unexplored but fundamental problem in online social 
networks - cascade size prediction: given the first n edges in 
a cascade, we want to predict whether the cascade will have 
a total of at least T2 (V2 > t\) edges over its lifetime. Be- 
sides serving the purpose of validating the relevance of our 
M'^C model, this prediction has many real-world applica- 
tions. For instance, it is useful for media organizations to 
forecast popular news stories 15 . Likewise, popular videos 
on social media - if predicted early - can be cached by con- 
tent distribution networks at their servers to achieve better 
performance [29|. Furthermore, solving this problem enables 
the early detection of epidemic outbreaks and political crisis. 

To the best of our knowledge, this problem has not been 
investigated in prior literature. The closest effort is that 
Galuba et al. analyzed the cascades of URLs on Twitter 
to predict URLs that users will tweet [12]. Their proposed 
approach achieved about 50% true positive rate with about 
15% false positive rate. Unfortunately, this accuracy is not 
much useful in practice. 

We compare the prediction performance of M^C based 
scheme with a baseline scheme that uses the following 8 cas- 
cade graph features with Naive Bayes classifier: (1) edge 
growth rate, (2) number of nodes, (3) degree of the root 
node, (4) average shortest path length, (5) diameter, (6) 
number of spanning trees, (7) clustering coefficient, and 
(8) clique number. We evaluate the effectiveness of these 
schemes in terms of the following decision sets. 

1. True Positives (TPs): The set of cascades that are cor- 
rectly predicted to have a total of at least T2 edges over 
their lifetime. 

2. False Positives (FPs): The set of cascades that are 
incorrectly predicted to have a total of at least T2 edges 
over their lifetime. 

3. True Negatives (TNs): The set of cascades that are 
correctly predicted to have a total of less than T2 edges 
over their lifetime. 

4. False Negatives (FNs): The set of cascades that are 
incorrectly predicted to have a total of less than r2 
edges over their lifetime. 

We further quantify the effectiveness of both cascade size 
prediction schemes in terms of the following three Receiver 
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Figure 7: Classification results of Af^C and baseline schemes for varying values of ri, at r-z — ri = 10. 
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Figure 8: Evaluation setup for varying r\. 
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To ensure that the classification results are generalizable, 
we divide the data set into k folds and use A; — 1 of them for 
training and the left over for testing. We repeat these exper- 
iments k times and report the average results in the follow- 
ing text. This setup is called stratified fc-fold cross-validation 
procedure [35]. For all experimental results reported in this 
paper, we use the value oi k = 10. 

In this paper, we treat the cascade size prediction prob- 
lem to an equivalent cascade classification problem: given a 
cascade with ri edges, classify it into two classes: the class of 
cascades that will have less than T2 edges over their lifetime 
and the class of cascades that will have greater than or equal 
to T2 edges over their lifetime. We use the initial ri edges to 
train both the cascade size prediction scheme based on our 
M^C model and the baseline scheme that is based on the 
known cascade graph features. For thorough evaluation, we 
vary the values of n and t2. Because the distribution of the 
number of edges in our data set is skewed, that is, most cas- 
cades having only a few edges over their lifetime, the larger 
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Figure 9: ROC plot of M C based scheme for varying 
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the values of n and T2 — ti axe, the more imbalanced the 
two classes are. To mitigate the potential adverse effect of 
class imbalance [l8], we employ instance re-sampling to en- 
sure that both classes have equal number of instances before 
the cross-validation evaluations. Below we discuss the clas- 
sification accuracies of both schemes as we vary the values 
of Tl and T2. 

5.1 Impact of Varying n 

Figure [S] shows the evaluation setup as we vary the values 
of Tl G {10, 50, 100}, while keeping - rj fixed at 10. The 
solid, dashed, and dotted vertical black lines corresponds to 
Tl — 10, 50, and 100. The solid, dashed, and dotted vertical 
grey lines all correspond to r2 — n = 100. The value of ri 
impacts the classification results because it determines the 
number of edges in each cascade that are available for train- 
ing. Therefore, larger values of ri generally improve training 
quality of both cascade size prediction schemes and lead to 
better prediction accuracy. 

Figure [T] plots the detection rate, false positive rate, 
and precision of M*C and baseline schemes for varying 
Tl e {10, 50, 100}, while keeping T2 — n fixed at 10. Overall, 
we observe that Af^C consistently outperforms the baseline 
scheme with peak precision of 96% at ri = 100, T2~ti = 10s. 
With some exceptions, we generally observe that the effec- 
tiveness of both schemes decreases as the value of n is in- 
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Figure 11: Classification results of M*C and baseline schemes for varying values of r2 — n, at ri = 10. 




Figure 10: Evaluation setup for varying T2 ~ t^. 

creased. The standard ROC threshold plots of M*C shown 
in Figure |9] also confirm this observation. 

5.2 Impact of Varying t2 - n 



Figure[To]shows the evaluation setup as we vary the values 
of T2~Ti € {10, 50, 100}, while keeping n fixed at 10. The 
solid vertical black line corresponds to ri = 10. The solid, 
dashed, and dotted vertical grey lines correspond to T2 — T1 — 
10, 50, and 100, respectively. The value of T2—T1 also impacts 
the classification results because it determines the separation 
or distance between the two classes. Therefore, larger values 
of T2 — Ti generally lead to better prediction accuracy. 

Figure [11] plots the detection rate, false positive rate, and 
precision of M^C and baseline schemes for varying values 
of T2 — Ti. Once again, we observe that M*'C consistently 
outperforms the baseline scheme with peak precision of 99% 
at T2 — Ti = 100, Ti = 10. We also observe that the classifi- 
cation performance of both methods improves as the value 
of T2 — Ti is increased. The standard ROC threshold plots 
of M'^C shown in Figure 
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also confirm this observation. 



6. CONCLUSIONS AND FUTURE WORK 

In this paper, we first propose M*C, a multi-order Markov 
chain based model to represent and quantitatively charac- 



Figure 12: ROC plot of M*C based scheme for vary- 
ing r2 - Ti . 



terize the morphology of cascades with arbitrary structures, 
shapes, and sizes. We then demonstrate the relevance of our 
M^C model in solving the cascade size prediction problem. 
The experimental results using a real-world Twitter data 
set showed that M^C significantly outperforms the base- 
line scheme in terms of prediction accuracy. In summary, 
our M^C model allows us to formally and rigorously study 
cascade morphology, which is otherwise difficult. 

In this paper, we applied our M*C model in the context of 
online social networks; however, our model is generally appli- 
cable to cascades in other contexts as well such as sociology, 
economy, psychology, political science, marketing, and epi- 
demiology. Applications of our model in these contexts are 
interesting future work to pursue. 
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